1 00:00:00,025 --> 00:00:05,410 [SOUND] Hi, everyone. 2 00:00:05,410 --> 00:00:08,900 Welcome back to the heterogeneous parallel programming class. 3 00:00:08,900 --> 00:00:13,130 We are at lecture 1.7, kernel-based parallel programming. 4 00:00:13,130 --> 00:00:17,850 And we will be discussing multidimensional kernel configuration today. 5 00:00:17,850 --> 00:00:20,110 The objective of this lecture is for you to 6 00:00:20,110 --> 00:00:25,255 understand the use of multidimensional thread grids and in particular, 7 00:00:25,255 --> 00:00:28,930 we're going to be discussing the use of multidimensional block 8 00:00:28,930 --> 00:00:33,080 and thread indices, and we're also be discussing how we 9 00:00:33,080 --> 00:00:36,940 can impact this map block and thread indices to 10 00:00:36,940 --> 00:00:42,470 data indices to processes two dimensional pictures and so on. 11 00:00:42,470 --> 00:00:45,780 And after this lecture you should be able to easily 12 00:00:45,780 --> 00:00:50,360 generalize all the concepts into the processing of three dimensional 13 00:00:50,360 --> 00:00:54,890 data structures using three dimensional blocks and grids. 14 00:00:56,670 --> 00:01:02,300 This picture was shown earlier in the introductory lecture where we 15 00:01:02,300 --> 00:01:08,260 assumed that the host has launched a thread of a kernel one. 16 00:01:08,260 --> 00:01:11,120 And this generates a grid of threads. 17 00:01:11,120 --> 00:01:15,438 And we assume that the user has specified that we're going to generate a 18 00:01:15,438 --> 00:01:18,950 two- dimensional grid which is reflected 19 00:01:18,950 --> 00:01:23,500 by the two-dimensional indices of the blocks. 20 00:01:23,500 --> 00:01:30,050 And then we also assume that each block has three dimensional threads. 21 00:01:30,050 --> 00:01:34,310 So each thread in this picture has three indices. 22 00:01:34,310 --> 00:01:41,080 The convention that we're going to be using is that for a for a two dimensional 23 00:01:41,080 --> 00:01:45,420 structure we're going to show x and y indexes, the first 24 00:01:45,420 --> 00:01:48,900 index will be x, and the second index will be y. 25 00:01:48,900 --> 00:01:52,410 And for a three dimensional structure, we're going to show 26 00:01:52,410 --> 00:01:55,990 the indices in the order of x, y, and z. 27 00:01:55,990 --> 00:02:01,160 So the first index will be x, second index will y, and the third index will be z. 28 00:02:01,160 --> 00:02:06,270 So, where, here, we have a very simple example and now we're ready 29 00:02:06,270 --> 00:02:11,210 to look at a more practical example. So we, we 30 00:02:11,210 --> 00:02:15,940 see here that a here a two dimensional picture of 31 00:02:15,940 --> 00:02:20,960 pixels. And we have 62 pixels in the y 32 00:02:20,960 --> 00:02:26,070 dim, dimension, and 72 6 pixels in the x dimension. 33 00:02:26,070 --> 00:02:32,200 So we will be calling this picture a 62 by 76 picture. 34 00:02:32,200 --> 00:02:38,180 Which means that the first number in the, in the 35 00:02:38,180 --> 00:02:42,240 configuration will be the number of pixels in the y dimension. 36 00:02:42,240 --> 00:02:45,400 And the second number will be in the x dimension. 37 00:02:45,400 --> 00:02:50,440 We assume that we're going to be using a 16 by 16 thread block. 38 00:02:50,440 --> 00:02:53,920 16 threads in the y dimension, and 16 39 00:02:53,920 --> 00:02:57,270 threads in the x dimension to process the picture. 40 00:02:57,270 --> 00:03:02,420 Just like in the vector, addition example. We need to make sure that we have enough 41 00:03:02,420 --> 00:03:08,460 threads in both y dimension and x dimension to process all the pixels. 42 00:03:08,460 --> 00:03:13,060 So we're going to need to launch four thread blocks 43 00:03:13,060 --> 00:03:17,980 in the y dimension to cover all the 62 pixels. 44 00:03:17,980 --> 00:03:21,540 When we launch four thread blocks, we have 64 threads. 45 00:03:21,540 --> 00:03:22,510 So we 46 00:03:22,510 --> 00:03:28,520 have all the 62 pixels covered, and we also need to have 2 ex-, we need to have, 47 00:03:28,520 --> 00:03:31,740 we need to, we will have two extra threads 48 00:03:31,740 --> 00:03:35,620 in the y dimension that should not be doing anything. 49 00:03:35,620 --> 00:03:40,330 And in the x dimension we're going to need to launch five thread blocks. 50 00:03:40,330 --> 00:03:42,370 And that will give us 80 threads. 51 00:03:42,370 --> 00:03:47,615 So we could, we would cover all the 76 pixels and we will have 52 00:03:47,615 --> 00:03:53,290 four threads that should not do anything. And this will be reflected in the way we 53 00:03:53,290 --> 00:03:58,320 write the kernel. Before we look at the kernel code, 54 00:03:58,320 --> 00:04:03,430 I'd like to quickly review the the way C and C plus 55 00:04:03,430 --> 00:04:08,800 plus lay out a two dimensional matrix, or two dimensional 56 00:04:08,800 --> 00:04:12,800 array into the linear address space. In the modern 57 00:04:12,800 --> 00:04:18,860 computers, all the memory data are stored in a linear address space 58 00:04:18,860 --> 00:04:25,350 and even though the conceptually, we have a two dimensional array in C, 59 00:04:25,350 --> 00:04:31,830 ultimately all the elements will be stored in a linear order and this is called the 60 00:04:31,830 --> 00:04:37,990 layout and C adopts the tradition, convention of Row-Major Layout. 61 00:04:37,990 --> 00:04:44,180 Which means that all the elements in a row are preserved in their relative position. 62 00:04:44,180 --> 00:04:47,100 So all the neighboring elements in row zero 63 00:04:47,100 --> 00:04:52,980 will be placed consecutively in the, memory, linearized memory. 64 00:04:52,980 --> 00:04:58,530 And all the elements in row one will also be placed in consecutive locations. 65 00:04:58,530 --> 00:05:04,110 So, in this particular case, whenever we need to access a, 66 00:05:04,110 --> 00:05:09,390 a element using a linearized address. We can generate that, 67 00:05:09,390 --> 00:05:13,730 address by multiplying the row index by the 68 00:05:13,730 --> 00:05:19,600 width of the of the array, and plus the column index. 69 00:05:19,600 --> 00:05:23,980 So, for example, you wanted to access element 70 00:05:23,980 --> 00:05:29,370 M21. We have a row index of two, and a column 71 00:05:29,370 --> 00:05:30,700 index of one. 72 00:05:30,700 --> 00:05:37,300 So we can multiply the, the, the row index 2 by the width, which is 4. 73 00:05:37,300 --> 00:05:42,250 And then we plus the column index, which is 1, and this gives us 9. 74 00:05:42,250 --> 00:05:46,100 So if you look at the top of the picture, 75 00:05:46,100 --> 00:05:50,440 we show the linearized index for all the M elements. 76 00:05:50,440 --> 00:05:54,580 The first four are zero, one, two, three, and then 77 00:05:54,580 --> 00:05:59,880 the next ones are four, five, six, seven. And so element 78 00:05:59,880 --> 00:06:04,170 nine is actually the original M two, one. And 79 00:06:04,170 --> 00:06:09,180 this will be used for accessing all the dynamically 80 00:06:09,180 --> 00:06:15,040 allocated arrays in in the CUDA kernels. 81 00:06:15,040 --> 00:06:19,630 And so the, even though, this is 82 00:06:19,630 --> 00:06:21,100 familiar to all of you. 83 00:06:21,100 --> 00:06:23,940 It's very important that you keep this in mind 84 00:06:23,940 --> 00:06:26,750 when you look at any of the CUDA kernels. 85 00:06:26,750 --> 00:06:30,310 And, for those of you who came from Fortran background. 86 00:06:30,310 --> 00:06:36,230 Fortran adopts a, nat, column nature layout convention, where all the 87 00:06:36,230 --> 00:06:42,860 elements in the, in the row, column, will have their relative positions preserved. 88 00:06:42,860 --> 00:06:44,890 So, in this particular case, 89 00:06:44,890 --> 00:06:50,070 if you look at the C, layout. The, the adjacent 90 00:06:50,070 --> 00:06:55,720 elements in a column are actually laid out in a, strided way, 91 00:06:55,720 --> 00:07:01,310 that are, have mult- so that they are with a throw width away from each other. 92 00:07:01,310 --> 00:07:05,370 Whereas, if you had a Fortran program, all these elements 93 00:07:05,370 --> 00:07:10,260 in the same column will actually be placed in adjacent locations. 94 00:07:10,260 --> 00:07:15,340 Now that we, we understand how, CUDA, or C and C 95 00:07:15,340 --> 00:07:20,790 Plus Plus linearize the a two dimensional array. 96 00:07:20,790 --> 00:07:24,620 Now we're ready to look at a, piece of kernel code. 97 00:07:24,620 --> 00:07:30,720 So here is a picture kernel, which is a very, very simple kernel that what 98 00:07:30,720 --> 00:07:35,880 a, multiply the pixel, every pixel value in a picture 99 00:07:35,880 --> 00:07:37,680 by 2.0. 100 00:07:37,680 --> 00:07:40,690 So, this is essentially magnifying the value of 101 00:07:40,690 --> 00:07:44,640 all the pic, pixel, pixels in a picture. 102 00:07:44,640 --> 00:07:50,390 So, here we show that the the kernel accepts four arguments. 103 00:07:50,390 --> 00:07:53,510 The first one is pointer to the input picture. 104 00:07:53,510 --> 00:07:56,220 The second one is pointer to the output picture. 105 00:07:56,220 --> 00:08:01,290 And the third one is the the number of elements in the in 106 00:08:01,290 --> 00:08:06,030 the row, in the x direction. And the fourth one is the 107 00:08:06,030 --> 00:08:10,510 number of elements in the in the vertical or 108 00:08:10,510 --> 00:08:15,790 y dimension. So when we 109 00:08:15,790 --> 00:08:21,580 determine which pixel a particular thread, should, process. 110 00:08:21,580 --> 00:08:27,060 We have a, y index generation, and x index generation. 111 00:08:27,060 --> 00:08:30,470 The y index generation generates a row index. 112 00:08:30,470 --> 00:08:36,030 Remember that row index selects one of the rows in that, in the, y dimension. 113 00:08:36,030 --> 00:08:39,270 So that's why we're using the y indices. 114 00:08:39,270 --> 00:08:41,300 So we will just use our familiar 115 00:08:41,300 --> 00:08:48,550 pattern of blockIdx.y time blockDim.y plus threadIdx.y. 116 00:08:48,550 --> 00:08:52,420 And this ensures that every possible row will 117 00:08:52,420 --> 00:08:57,490 have a, a thread that that covers that row. 118 00:08:57,490 --> 00:09:01,390 And then we have the same pattern for the x dimension. 119 00:09:01,390 --> 00:09:07,340 And this is pretty much identical to the vector addition case, where every element 120 00:09:07,340 --> 00:09:13,450 in the explanation will have a thread to cover that particular pixel value. 121 00:09:13,450 --> 00:09:17,760 So, now that we have generated appropriate row index and 122 00:09:17,760 --> 00:09:19,220 column index. 123 00:09:19,220 --> 00:09:21,170 And all the threads can now use the 124 00:09:21,170 --> 00:09:24,660 row index and column index to do the processing. 125 00:09:24,660 --> 00:09:31,260 We can first test whether a whether the thread is within the valid range. 126 00:09:31,260 --> 00:09:35,580 So we will first test if row index is less than m. 127 00:09:35,580 --> 00:09:41,630 So m is the number of pixels in the y dimension. 128 00:09:41,630 --> 00:09:43,180 So the row index needs 129 00:09:43,180 --> 00:09:45,100 to be within that range. 130 00:09:45,100 --> 00:09:49,680 And also, we will test whether the column index is less than n. 131 00:09:49,680 --> 00:09:52,460 N is the number of pixels in the x dimension. 132 00:09:52,460 --> 00:09:57,850 So whenever both conditions are satisfied, we know that a thread is, we think, the 133 00:09:57,850 --> 00:10:03,950 valid range of a two-dimensional area where there, there are valid pixels. 134 00:10:03,950 --> 00:10:09,370 So now the thread can proceed to to multiply the 135 00:10:09,370 --> 00:10:12,870 the input pixel by two and assign that to the output. 136 00:10:12,870 --> 00:10:18,130 So, here, you show, you see the linearized addressing where, the thread 137 00:10:18,130 --> 00:10:23,370 is going to use row multiplied by n, which is number of pixels in each row. 138 00:10:23,370 --> 00:10:26,960 So this is the width, plus, column index. 139 00:10:26,960 --> 00:10:29,420 And, you're, we're also going to use exactly the 140 00:10:29,420 --> 00:10:32,740 same expression to write into the output matrix. 141 00:10:32,740 --> 00:10:34,510 So, or output array. 142 00:10:34,510 --> 00:10:41,400 So this will allow every thread to pick the correct its assigned row and 143 00:10:41,400 --> 00:10:46,730 column index position pixel, multiply by 2, and write that into the output array. 144 00:10:50,640 --> 00:10:58,110 So here we show the host code that that launches the pic, picture kernel. 145 00:10:58,110 --> 00:11:02,990 So, we assume that the host host in the host code, the variable 146 00:11:02,990 --> 00:11:08,330 n, host the number of pixels in the y dimension and 147 00:11:08,330 --> 00:11:13,540 the n shows the host the number of pixels in the x dimension [COUGH]. 148 00:11:13,540 --> 00:11:16,710 In our example of 149 00:11:16,710 --> 00:11:24,808 62 times 76, m will be equal to 62 and n will be equal to [COUGH] 150 00:11:24,808 --> 00:11:26,205 76. 151 00:11:26,205 --> 00:11:30,710 So now we were going to launch the kernel. 152 00:11:30,710 --> 00:11:33,600 And this is how we make sure that all the kernels, 153 00:11:36,410 --> 00:11:39,480 excuse me, will have enough threads. 154 00:11:39,480 --> 00:11:41,860 The kernel will be executed by enough threads to 155 00:11:41,860 --> 00:11:45,530 cover both the y dimension and the x dimension. 156 00:11:45,530 --> 00:11:49,470 So, we take, the, n value and we are 157 00:11:49,470 --> 00:11:53,200 doing a seeding function, essentially a seeding function of 158 00:11:53,200 --> 00:11:56,660 n divided by 16, this n minus the familiar 159 00:11:56,660 --> 00:11:59,905 expression of n minus 1 divided by 16 plus 1. 160 00:11:59,905 --> 00:12:01,640 [COUGH] We 161 00:12:01,640 --> 00:12:08,136 do the same thing for x. So we have m minus 1 divided by 16 plus 1. 162 00:12:08,136 --> 00:12:13,790 [COUGH] Excuse me. [SOUND] And 163 00:12:13,790 --> 00:12:19,480 then, with each block, we're going to just have 16 threads in the x dimension. 164 00:12:19,480 --> 00:12:23,950 16 threads in the y, dimension. And one thread in the z dimension. 165 00:12:23,950 --> 00:12:26,920 So this gives us a two dimensional grid, a two dimensional 166 00:12:28,780 --> 00:12:32,410 block structure to cover the picture given. 167 00:12:32,410 --> 00:12:36,980 And then we, we use the configuration to launch the kernel, 168 00:12:36,980 --> 00:12:41,780 and then we call the PictureKernel with the DimGrid and DimBlock. 169 00:12:41,780 --> 00:12:44,320 Now we give the the four arguments. 170 00:12:44,320 --> 00:12:49,606 The pointer to the input picture, pointer to the output picture, m and n. 171 00:12:49,606 --> 00:12:54,670 So this will ensure that the kernel will have enough 172 00:12:54,670 --> 00:13:00,050 threads to cover both dimensions and then you'll, we will be able to 173 00:13:00,050 --> 00:13:03,550 in the kernel we'll be able to test whether each thread should be 174 00:13:03,550 --> 00:13:08,750 acted, acting on a pixel or the thread is outside the valid 175 00:13:08,750 --> 00:13:14,420 dimension, so that, valid range, so that the thread should not take any action. 176 00:13:16,740 --> 00:13:19,290 So this picture shows a little bit more 177 00:13:19,290 --> 00:13:23,280 analysis of what actually happens at run time. 178 00:13:23,280 --> 00:13:29,450 So we launched four thread blocks by five thread blocks. 179 00:13:29,450 --> 00:13:33,450 So we actually end up launching 20 thread blocks. 180 00:13:33,450 --> 00:13:38,400 And then we're actually going to have four different situations, 181 00:13:38,400 --> 00:13:42,560 as far as the execution of the thread blocks are concerned. 182 00:13:42,560 --> 00:13:46,930 The first case is in the upper left corner. 183 00:13:46,930 --> 00:13:51,460 Where all the threads will have both their y 184 00:13:51,460 --> 00:13:55,570 index, and x index, to be in the valid range. 185 00:13:55,570 --> 00:14:01,810 So, all the threads within these three by four, 12, all these 12 thread 186 00:14:01,810 --> 00:14:08,050 blocks will have all their threads, fully executing, and fully processing 187 00:14:08,050 --> 00:14:10,170 the picture. 188 00:14:10,170 --> 00:14:13,900 And the second one, that's the one on the upper right side. 189 00:14:13,900 --> 00:14:19,080 These threads will find their y indices within the valid range, but 190 00:14:19,080 --> 00:14:23,370 x indices in the not in the valid range, well some of 191 00:14:23,370 --> 00:14:26,600 the x indices will be in the valid range, and some of 192 00:14:26,600 --> 00:14:30,290 the x x in, indices will not be in the valid range. 193 00:14:30,290 --> 00:14:33,930 So, in this case, some of the threads 194 00:14:33,930 --> 00:14:40,318 in that in, in the, on the right end, in these thread blocks will not execute. 195 00:14:40,318 --> 00:14:45,080 And then we have the third, situation where, all the threads in 196 00:14:45,080 --> 00:14:49,800 the x dimension will find their x indices in the valid range. 197 00:14:49,800 --> 00:14:53,070 But some of the threads in the y dimension 198 00:14:53,070 --> 00:14:57,510 will find their index to be outside the valid range. 199 00:14:57,510 --> 00:14:58,570 So in this, 200 00:14:59,670 --> 00:15:04,840 in the third case, we will have this thread blocks will have some of their 201 00:15:04,840 --> 00:15:12,080 threads in the bottom of their of the thread block not processing any pixels. 202 00:15:12,080 --> 00:15:14,740 And finally, we have the fourth case. 203 00:15:14,740 --> 00:15:22,000 Where the threads, some of the threads in the x, will find their x index 204 00:15:22,000 --> 00:15:24,740 to be in the, outside the valid range. And some of the 205 00:15:24,740 --> 00:15:28,960 threads will find their y index to be outside the valid range. 206 00:15:28,960 --> 00:15:33,870 And in fact some of the threads in the very corner, the lower right corner, will 207 00:15:33,870 --> 00:15:40,050 find their, both of their x and y indices outside the value range. 208 00:15:40,050 --> 00:15:45,210 So these, you know, situations will all happen. 209 00:15:45,210 --> 00:15:50,190 And this is an important picture for you to remember. 210 00:15:50,190 --> 00:15:53,780 When we just begin to discuss the performance implications. 211 00:15:53,780 --> 00:15:59,200 For now, it suffices for you to see, to remember that whenever we do, 212 00:15:59,200 --> 00:16:04,960 we have a process, a two dimensional picture and we use boundary check. 213 00:16:04,960 --> 00:16:10,170 To to check whether each thread has its y index or x 214 00:16:10,170 --> 00:16:15,840 index within the valid range then we will end up with these four possible 215 00:16:15,840 --> 00:16:18,200 situations, or four possible cases. 216 00:16:20,960 --> 00:16:23,755 And this brings us to the conclusion of this lecture. 217 00:16:23,755 --> 00:16:29,371 And for those of you who would like to understand more about the material, 218 00:16:29,371 --> 00:16:35,055 I would like to encourage you to read sections 4.1 and 4.2 of the textbook. 219 00:16:35,055 --> 00:16:35,072 Thank you.